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Abstract 

Most trainable machine translation (MT) 
metrics train their weights on human judg¬ 
ments of state-of-the-art MT systems out¬ 
puts. This makes trainable metrics biases 
in many ways. One of them is preferring 
longer translations. 

These biased metrics when used for tun¬ 
ing are evaluating different types of trans¬ 
lations - n-best lists of translations with 
very diverse quality. Systems tuned with 
these metrics tend to produce overly long 
translations that are preferred by the met¬ 
ric but not by humans. 

This is usually solved by manually tweak¬ 
ing metric’s weights to equally value recall 
and precision. Our solution is more gen¬ 
eral: (1) it does not address only the recall 
bias but also all other biases that might be 
present in the data and (2) it does not re¬ 
quire any knowledge of the types of fea¬ 
tures used which is useful in cases when 
manual tuning of metric’s weights is not 
possible. 

This is accomplished by self-training on 
unlabeled n-best lists by using metric that 
was initially trained on standard human 
judgments. One way of looking at this is 
as domain adaptation from the domain of 
state-of-the-art MT translations to diverse 
n-best list translations. 

1 Motivation 

Evaluation metrics that are used in Machine Trans¬ 
lation (MT) are usually trained on human judg¬ 
ments of outputs from state-of-the-art MT systems 
that participate in competitions such as WMT. Hu¬ 
mans often prefer longer translations over short. 
They prefer to have additional potentially wrong 


information that they can disambiguate than to 
miss some information). 

Training metrics on human judgments that pre¬ 
fer longer translations makes metrics give more 
importance to the recall than precision. While this 
might be a right decision for the metrics task it can 
be be very wrong in other applications of evalua¬ 
tion metrics such as tuning. 

If MT system is tuned with the metric that 
prefers recall over precision that system will in 
the end have a low word penalty and produce very 
long translations. 

The reason for this is that the translations that 
are evaluated during tuning are translations of very 
different quality (quality that is far from state-of- 
the-art MT output). Having metric trained on one 
domain (state-of-the-art MT output) and used on 
another (sample of search space of MT decoder) 
makes a mismatch that is very harmful for tuning. 

We look at this as a problem similar to domain- 
adaptation and apply one of the simplest tech¬ 
niques that exist for domain adaptation - self¬ 
training ( [Abney, 2OO7[|S0gaard, 2013] ). 

We train our metric BEER in a standard way 
on human judgments of WMT 13 and WMT 14 
data using learning-to-rank methods presented in 
(Stanojevic and Sima’an, 2014). 

For self-training we collect n-best lists on 
WMT 12 test data and then sample pairs of trans¬ 
lations (first hypothesis, second hypothesis, refer¬ 
ence tuple). Initial metric decides which of these 
metrics is more likely to be better translation and 
which one to be worse translation. After we create 
many of these automatically ranked pahs we treat 
them as if they are ranked by humans and train 
our metric again. This process can be repeated for 
several iterations but we do only one. 


2 Experimental results 

In Table |T| we have the results on tuning on 
WMT 14 data and testing on WMT 13 as testing 







tuning metric 

BLEU 

MTR 

BEER 

Length 

In this paper we addressed this problem from 

BEER 

16.4 

28.4 

10.2 

115.7 

more general perspective that: 

BLEU 

18.2 

28.1 

10.1 

103.0 


BEER_no_bias 

18.0 

27.7 

9.8 

99.7 

• tries to remove any bias (not only recall bias) 


Table 1: Tuning results with BEER without bias 
on WMT14 as tuning and WMT13 as test set 


system 

human score 

bleu-MIRA-dense 

0.159 

ILLC-UvA 

0.108 

AFRL 

0.081 

bleu-MERT 

0.075 

USAAR-Tuna-Saarland 

0.013 

DCU 

-0.01 

METEOR-CMU 

-0.095 

bleu-MIRA-sparse 

-0.139 

HKUST-MEANT 

-0.192 


• repairs bias in models with large number of 
features in which manual weight tuning is not 
possible 

This allows us to have more freedom in choos¬ 
ing the features of the metric without worrying 
whether it would bias the learner in the wrong di¬ 
rection. This type of metric with smaller bias is 
preferable for tuning which is confirmed by the 
WMT15 tuning task. 
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Table 2: Preliminary tuning task results (August 
4th 2015) for Czech-English; self-trained BEER 
is named ILLC-UvA 

data. 

Before the automatic adaptation of weights 
for tuning, tuning with standard BEER produces 
translations that are 15% longer than the refer¬ 
ence translations. This behavior is rewarded by 
metrics that are recall-heavy like METEOR and 
BEER and punished by precision heavy metrics 
like BLEU. After automatic adaptation of weights, 
tuning with BEER matches the length of reference 
translation even better than BLEU and achieves 
the BLEU score that is very close to tuning with 
BLEU. This kind of model is disliked by ME¬ 
TEOR and BEER but by just looking at the length 
of the produced translations it is clear which ap¬ 
proach is preferred. 

In Table [2] we can see the results of the WMT15 
tuning task. The baseline is the best tuning system, 
but from all submitted systems to the task (other 
than baseline) BEER without bias is the most pre¬ 
ferred one by humans. 
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3 Conclusion 


Trainable MT metrics have problem of being 
trained on very biased data. Usually the previous 
work was concentrated on recall bias which was 
corrected by manually setting equal weights for 


recall and precision ([Denkowski and Lavie, 2011 


He and Way, 2009). 








